The ChoiceMaker 2 Record Matching System
نویسنده
چکیده
This paper describes the key features of an innovative record matching system called ChoiceMaker 2 developed by ChoiceMaker Technologies (CMT). We begin with an overview of the stages that a record matching system goes through to find an incoming “query record” in a database. We then consider the stages one by one: We sketch out our patent-pending process for identifying possible matches to the query record, which is known as “blocking”. We describe the process by which we use a machine learning technique known as maximum entropy modeling to tune the system to the problem at hand. Next we describe the ClueMaker programming language that CMT has developed for describing record matching characteristics. We describe our method for testing record matching models and describe how our IDE facilitates this process. We describe the process by which we develop record matching models. Finally, we discuss systems integration issues and the interfaces that ChoiceMaker offers for deployment. 1. APPROXIMATE RECORD MATCHING Approximate record matching is employed when information is not always identified by a reliable unique key. Record-matching tasks can be broken down into three main categories: • Duplicate record removal or linkage: The same person, business, or thing is present more than once in a database. Duplicate records are removed or linked together. • Database linkage: Two databases are linked or merged. This might occur, for instance, because of a corporate merger, to build a data warehouse, or to prevent duplication of effort by storing information common to multiple databases in a single enterprise-wide database. • Approximate database search: Search a database for records similar to an input record. Similarly, prevent users from adding duplicate records to the database by providing a real-time check of whether a record entered on a user entry screen is present in the database. In all of these cases the basic problem is essentially the same. Given an input or query record, search a target database for record(s) that denote the same thing (e.g., the same person, product, or company). 1.1 Matching Process Overview Advanced approximate record matching systems generally perform matching as a two-step process, as illustrated in Figure 1: 1. Query record. A query record is sent to the matching engine. 2. Blocking. The engine searches the target database for records that are possible matches to the query record. The objective at this stage is to retrieve all possible matches and not too many non-matches. 3. Many possible matches. This is the set of records returned by blocking which are possible matches to the query record. 4. Decision making. For each possible match, the matching engine determines the probability that the record denotes the same thing as the query record. Possible matches are sorted into matches, potential matches, and non-matches based on two user-defined thresholds. I.e., any record matching the query record with a probability higher than the “match threshold” is declared a “match”. Figure 1: ChoiceMaker Matching Overview
منابع مشابه
Key Concepts in the ChoiceMaker 2 Record Matching System
We describe an innovative record matching system called ChoiceMaker 2 we developed at ChoiceMaker Technologies (CMT). Firstly, we describe the process by which we use a machine learning technique known as maximum entropy modeling to tune the system to the problem at hand. Secondly, we describe the ClueMakerTM programming language that is used to describe record matching characteristics. Thirdly...
متن کاملRecord Matching for a Large Master Client Index at the New York City Health Department
Executive Summary/Abstract: The New York City Department of Health and Mental Hygiene has a pressing need to accurately identify individuals for a variety of public health purposes. This led to the construction of the Master Client Index (MCI). The system offers a department-wide service that provides fast, real-time processing of incoming medical records to determine whether the individual is ...
متن کاملCLUEMAKER : A LANGUAGE FOR APPROXIMATE RECORD MATCHING ( Practice - Oriented )
We introduce ClueMaker, the first language designed specifically for approximate record matching. Clues written in ClueMaker predict whether two records denote the same thing based on the values of the records’ attributes. For example, a clue may predict match if the records have identical values for the first name attribute. The values of the clues can then be used as input to a matching algor...
متن کاملCLUEMAKER : A LANGUAGE FOR APPROXIMATE RECORD MATCHING ( Complete Paper )
We introduce ClueMaker, the first language designed specifically for approximate record matching. Clues written in ClueMaker predict whether two records denote the same thing based on the values of the records’ attributes. For example, a clue may predict match if the records have identical values for the first name attribute. The values of the clues can then be used as input to a machine-learni...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005